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Abstract A generalised notion of exponential families is introduced. It is based on 
the variational principle, borrowed from statistical physics. It is shown that inequiv- 
alent generalised entropy functions lead to distinct generalised exponential families. 
The well-known result that the inequality of Cramer and Rao becomes an equality 
in the case of an exponential family can be generalised. However, this requires the 
introduction of escort probabilities. 
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1 Introduction 

Generalised entropy functions have been studied intensively in the second half of the 
past century. They have been called quasi-entropies in [10]. Every entropy function is 
in fact minus a relative entropy, also called a divergence. It is relative to some reference 
measure c. Consider the f-divergence [3] 

m\c) = ^ Ca f(p a / Ca ), (i) 

a 

with f(u) a convex function defined for u > and strictly convex at u — 1. It is minus 
the entropy of p, relative to c. Taking c a = 1 for all a and f(u) = u\nu one obtains 
the Boltzmann-Gibbs-Shannon entropy 

Hp) = - ^ Pa In Pa- (2) 
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Note that throughout the paper discrete probabilities are considered, with events a 
belonging to a finite or countable alphabet A. 

Recent interest in these generalised entropies within statistical physics goes back 
to the introduction by Tsallis jH] of the g-entropy 



with q > 0. In the limit q — 1 it converges to (J2J). It has been studied before in the 
mathematics literature by Havrda and Charvat [5j, and by Daroczy [3]. Investigations 
within the physics community have lead to some interesting developments. One of 
them is the introduction of deformed logarithmic and exponential functions [T5J [6] 
see the Section 13. They have been very useful to generalise common concepts, like that 
of an exponential family or of a Gaussian distribution. They also helped to clarify the 
pitfalls of the generalisation process. One of the surprises is the necessity to introduce 
escort probability functions [T7] - see Section 11. In a series of papers, including 
[3 E], the present author has elaborated a formalism based on deformed logarithms. 
In the present work, it is shown that slightly more general results are obtained when 
abandoning these deformed logarithms. 

In Sections 2 to 6 the maximum entropy principle and the variational principle are 
discussed in the context of generalised entropies. In particular, a characterisation of 
the maximising probability distributions is given. This is used in Section 7 to define a 
generalised exponential family. In Section 8 it is shown that the intersection of distinct 
generalised exponential families is empty and that there exists a one-to one-relation 
with generalised entropy functions. Sections 9 tot 12 discuss geometric aspects, starting 
with concepts from thermodynamics and introducing escort families and a generalised 
Fisher information matrix. Sections 13 and 14 discuss non-extensive thermostatistics 
and the percolation problem as examples of the generalised formalism. The paper ends 
with a short diascussion in Section 15. 

2 Generalised entropies 

Let us fix some further notations. The space of probability distributions is denoted 
A4f(A). Expectation values are denoted (p,X) = ^2 aeA p a X(a). Here we follow the 
physics tradition to put the elements of the dual space at the l.h.s.. 

It is rather common to define a generalised entropy as any function I(p) of the form 



where h(u) is a continuous strictly concave function, defined on [0, 1], which vanishes 
when u — or u = 1. This is a special case of minus the f-divergence (JTj), with weights 
c a = 1. The entropy function I(p) is defined for any p E JAf(A) and has values in 






(4) 
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[0, +oo]. In the present paper it is allowed that the function h{u) is stochastic, this 
means, depends also on a in A. But for convenience of notation, this dependence will 
not be made explicit. 

Throughout the paper it is assumed that the derivative 

£ = -/<-> (5) 

exists on the interval (0, 1) and defines a continuous function on the halfopen interval 
(0,1]. Because h(u) is strictly concave, f(u) is strictly increasing. Note that it is 
allowed to diverge to — oo at u = 0. This is indeed the case when h(u) = — ulnu and 
f(u) — 1 + In u. 

The function f(u) can be used to rewrite the entropy I(p) as 



dvf(p a v). (6) 



aeA^P" aGA" 70 aGA " , ° 

Note that the latter expression implies that 

i(p)>-J2pJ(po)- (7) 

aGA 

The standard definition of the Bregman divergence [2] reads 

D(p\\q) = I(q) - I(p) - J2(Pa - 5«)/(<Z«)- (§) 

aGA 

In the case that /(it) diverges at u = it is only well defined when q a = implies 
p a = 0. It is a convex function of the first argument. Note that one can write 

D(p\\q) = J2 [/(«)"/(&)]■ (9) 

aGA 

From the latter expression it is immediately clear that D(p\\q) > 0, with equality if 
and only if p = q. 



3 Maximum entropy principle 

Let be given a finite number of real functions Hi(a), H 2 (a), ■ ■ •, H n (a). Assume they are 
bounded from below. In a physical context these functions may be called Hamiltonians. 
The maximum entropy problem deals with finding the probability distribution p that 
maximises I(p) under the constraint that the expectation values of the Hamiltonians 
Hj attain given values Uj, called energies. Introduce the notation 

Vu = {peMt: (p,H j ) = U j ,j = l,2,---,n}. (10) 

Then one looks for the probability distribution p e Vy which maximises I(p). 



3 



Definition 1 A probability distribution p* G Vjj is said to satisfy the maximum entropy 
principle if it satisfies 

Hp) < Hp*) < +00 for all P g v v . (11) 

In what follows a stronger condition is needed. It was introduced some 40 years 
ago [11] - see Theorem 7.4.1 of (12] — and is in fact a stability criterion. 

Definition 2 A probability distribution p* is said to satisfy the variational principle if 
there exist parameters 9±, 62, • • • , n such that 

n n 

+ 00 > I(p*) - e Av\ H s ) > HP) - E H,) for all p G Mf- (12) 
3=1 3=1 
In statistical physics, a probability distribution satisfying the variational principle is 
called an equilibrium state. 



4 Lagrange multipliers 

A popular way to solve the maximum entropy problem is by the introduction of La- 
grange parameters. However, a difficulty arises, known as the cutoff problem. It is 
indeed possible that some of the probabilities p a of the optimising probability distri- 
bution vanish. Let us see how this problem arises. The Lagrangean reads 

n 

c = Hp) -<xJ2p"-J2 9 3<p> ( 13 ) 

aGA j=l 

Here, a is the parameter introduced to fix the normalisation condition YlaeA Po = 1? 
the 9j are introduced to cope with the constraints ffTUj) . Variation of £ w.r.t. the p a 
yields 

n 

f{p a ) = -a-Y,d 3 H J {a). (14) 
i=i 

The problem that can arise is that it may well happen that the r.h.s. of this expression 
does not belong to the range of the function f(u). This situation is particularly likely 
to occur when f(u) does not tend to —00 when u tends to 0. If the r.h.s. is in the 
range of f(u) then p a is determined uniquely by ffT4"l) because of the assumption that 
f(u) is a strictly increasing function. 

The above problem is well known in optimisation theory. Because the constraints, 
defining Vv, are affine, the set Vjj forms a simplex. Its faces are obtained by putting 
some of the probabilities p a equal to zero. Because the entropy function I(p) is concave 
it attains its maximum within one of these faces. This observation leads to the ansatz 
that the probability distribution p, which maximises I(p) with p in Vu, if it exists, is 
determined by a subset A = {a G A : p a = 0}, and by the values of the parameters a 
and 8j, which determine the remaining probabilities via (I14p . Let us now try to prove 
this statement. 
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5 Characterisation 

Let us first consider the more familiar situation that /(0) = — oo. 

Lemma 1 Assume /(0) = — oo. Let p* E M.^ satisfy the variational principle. Then 
p* a > holds for all a E A. 



Proof 

The inverted statement is proved. 

Because of the normalisation, there exists at least one a E A for which p* a > 0. 
Assume b E A such that pi = 0. Let us show that this implies that p* does not satisfy 
the variational principle. 

Fix < e << 1. Introduce a new probability distribution p which coincides with 
p* except that 



Let 



Then one has 

dM 



Pa = (1 - e)p a and p b = ep* a . (15) 



M(e)=I(p)-J2^(P^H J ). (16) 



From the assumption /(0) = — oo then follows that 

bm— — = +oo. (18) 
40 de 

This proves that p* does not satisfy the variational principle because for e sufficiently 
small M(e) is strictly larger than M(0). 

□ 



Theorem 1 Assume /(0) = — oo. A probability distribution p* satisfies the variational 
principle if and only if there exists a and 9\, 9 2 , • ■ ■, 6 n such that [L~$ holds for all 
a E A. 



Proof 
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First assume that p* satisfies (I14p . This implies that p* > for all a e A because 
/(0) is not defined. Hence, the divergence D(p\\p*) is well defined for all p. Next one 
calculates 



D(p\\p*) = I(p*)-I(p)-J2(Pa-p* a )f(pl) 



aeA 



I(p*)-I(p)-J2(P«-P*a) 



aeA 



3=1 



= I(p*)-I(p) + J29 3 (p-P*,H 3 ). (19) 

Because D(p\\p*) > with equality if and only if p = p* there follows that p* satisfies 
the variational principle. 

Next assume that p* satisfies the variational principle ffl2|) . From the lemma then 
follows that p* > for all a £ A. Hence, the divergence D(p\\p*) is well-defined for all 
p G Aii . It follows from the variational principle that 



D(p\\ P *) = i(p*)-i( P )-j2(Pa-P* a )f(p:) 



aeA 

n 

> ^(p'-PtHA-Y^ipa-PlVbl). (20) 

j=l aeA 

Now, the function p — > D(p\\p*) is convex with continuous derivatives. The r.h.s. of the 
above expression is afllne. Both l.h.s. and r.h.s. vanish for p = p*. One then concludes 
that the r.h.s. is tangent to the convex function and must be identically zero. One 
concludes that for all p 

n 

J2(Pa ~ Pl)f(p* a ) = £ Ojtf - P , Hi). (21) 

aeA j=l 

This implies that /(p*) is of the form ffHl) — take p a = 5 a ^ for some fixed b to see this. 

□ 



6 The case with cutoff 

Assume now that f(0) = lim M ^ /( M ) converges. Then the divergence D(p\\q) is well 
defined for any pair of probability distributions p, q. 

Theorem 2 Assume that f(0) =lim u jo/(w) converges. Are equivalent 

1. p* satisfies the variational principle; 

2. there exist parameters a and 9\, 6 2 , ■ ■ ■, 9 n , and a subset A of A such that 
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• |7^| j is satisfied for all a G A \ A ; 

• p* a = for all a G A ; 

n 

• /(°) + E d o H A a ) > -« /° r aW «e4 

5=1 

Note that this last condition expresses that the r.h.s. of (1141) is out of the range of 
f(u) because it takes a value less than /(0). 

Proof 



1) implies 2) As in the proof of the previous Theorem, one shows that (|20|) holds 
for all p. But now one cannot conclude (12 ip because some of the p* may vanish so that 
p* lies in one of the faces of the simplex Aif. But one can still derive (j!4p for all a for 
which p* 7^ 0. 

Assume now that p* a = for some given a E A. Let 

p 6 = (1 - + eS b>a . (22) 

Then the l.h.s. of (120]) becomes 

£>(p|b*) = E r dM [/(rf) - /Ml + f dM ^ - 

beA J (l-*M J o 
< eJ2pt[fipt)-f(^-^)Pt)}+ f e duf(u)-ef(0) 



beA 

3\ 



= 0(e 2 ). (23) 
On the other hand, the r.h.s. of (I2"0"j) becomes 

n n 

r.h.s. = e ^^.^p*(6)^.(6)- e ^^.(a)+ e ^p*/(K)-e/(0). (24) 

j=l beA j=l beA 

From the inequality (|20|) then follows 

> Y^Ojip*,!!;) -Y^BiHjia) + J2ptf(pt) - /(0). (25) 
i=i i=i beA 

This implies the desired inequality because 

n 

-a = J2P*bf(Pt) + 5><P*,#i>- (26) 
6eA j=i 
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2) implies 1) One calculates 

n n 

j(p)-X><p,^> = -D(p\\ P *)+i(p*)-j2(Pa-p:)f(p:)-J2 e i(p> H i) 

< I(p*)-f(0)J2Pa 



3=1 



3=1 



aeA 



+ (P"-P*a) 
a£A\A 

n 

/(?*)- E^*>^> 



3=1 



3=1 



3=1 



aeA 



3=1 



(27) 



The variational principle now follows using the third assumption of the Theorem. 



□ 



7 Statistical models 

In the definition of the variational principle there is given a set of Hamiltonians Hi(a), 
H 2 (a), ■ ■ -, H n (a), this means, real functions over the alphabet A, bounded from below. 
The equilibrium distribution p* is then characterised by a normalisation constant a, 
by parameters 9\, 82, • • •, n , and by a subset Aq of the alphabet A — see ( CHI) . The 
emphasis now shifts towards these parameters. 

Theorem 3 Let be given Hamiltonians Hi(a), H2(a), ■ ■ H n (a). For each 9 in W 1 
there exists at most one probability distribution p* satisfying the variational principle 
(El)) with these parameters 9. 



Proof 

If p* and q* both satisfy the variational principle (|T2|) with the same parameters 9 
then also the convex combination r* = |p* + |g* has the same property because the 
entropy function is concave. But then one can conclude from the inequalities (fT2j) that 
J(r*) = \I{p*) + \l{(f) ■ Because the entropy function is strictly concave there follows 
p* = q*. 

□ 

The set of 9 for which a p* exists, satisfying the variational principle (|T2l . is denoted 
V. The probability distribution is denoted pe instead of p*. The constant a appearing 
in (P3J is replaced by a{9). 
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A statistical model is a parametrised set of probability distributions. The above 
Theorem implies that the set {pe)e&Vi of probability distributions satisfying the varia- 
tional principle, is a statistical model. One can say that such a model belongs to the 
generalised exponential family. 

Definition 3 Let be given a generalised entropy function I(p) of the form (fj]j. A 
statistical model (pe)eev belongs to the generalised exponential family if there exist real 
functions Hi(a), H 2 (a), ■ ■•, H n (a), bounded from below, such that each member pg of 
the model satisfies the variational principle [W^l with these Hamiltonians and with this 
set of parameters. 

Clearly, entropy functions which differ only by a scalar factor determine the same 
generalised exponential family. 

8 Uniqueness theorem 

Let us now turn to the question whether a given model (pe)eev can belong to two 
different generalised exponential families. 

Theorem 4 Let be given a model (pe)eev- Assume that there exists an open subset T>q 
ofV with the property that the set of values of po A covers the open interval (0, 1) 



// the model belongs to two different generalised exponential families, one with entropy 
function I\{p), the other with entropy function ^(p), then there exists a constant A 
such that ^(p) = A/j(p) for all p. 

Proof 

Take any point u in (0, 1) and a corresponding 9 &V and a such that pe >a = u. From 
the previous theorems follows that there exist functions oti(9) and Hamiltonians Hn(a), 
Hi2(a), ■ ■ -, H in (a), with i — 1,2, such that 



Let F a = $2,0, ° fi a ■ Note that this is a strictly increasing continuous function. Then 
one has 






(29) 



Fa - 



) 



a 2 (9)-J2^H 2J (a). 



n 



(30) 
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This relation holds also on a vicinity of 6 e V . It therefore implies the existence of A a 

(31) 



and Kij such that 



H 2d (a) - K 2J = A„(F 1J (a) - K hj ), j = 1, 2 
Then one can rewrite fl30l) as 

= 7 a (0) + A a v, 

with 



,n. 



7aW = -a 2 W-^^^ 2j + A a 



ax(0)+J>i#U 



(32) 



(33) 



valid for some neighbourhood of the given 9. Using the definition of F a (v) one obtains 

/2,»(«) = 7a(0)+A a /l,a(«), (34) 

valid on some neighbourhood of the given w e (0, 1). Because it is arbitrary and the 
functions /j a are continuous, the same expression must hold on all of (0,1]. From 
= hi ta (0) = Jq 1 du fi t a(u) n o w follows that 7 a (6 ) ) = 0. Therefore (1331) becomes 



A a 



(35) 



«iW + E"=i^i,/ 

In particular, A a does not depend on a e A. One concludes therefore that there exists 
A so that f2,a(u) = Xfi }a (u). This implies hip) = A/i(p). 

□ 



9 Thermodynamics 

Throughout this Section, let be given a statistical model {pe)e^v belonging to the 
generalised exponential family. 

Note that if p e and p v both belong to the same set V\j then they satisfy I(pe) = 
I(p v )- Hence, a function S(U) can be defined by 

S(U) = I(pe) whenever (pe, Hj) = Uj for j = 1, 2, • • • , n. (36) 

This function is called the thermodynamic entropy. The concept of thermodynamic 
entropy was first introduced by Clausius around 1850. The Legendre transform of the 
thermodynamic entropy is given by 

n 

m = sMS(U)-J2 e M}- (37) 

This function was introduced by Massieu in 1869. The suprememum is taken over all 
U for which S(U) is defined by (136I) . The function is convex — this is a well-known 
property of Legendre transforms. 
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Proposition 1 One has 

n 

^9)=I(p e )-J2 e APo,H J ), 6eV. (38) 

3=1 

Proof 

Given 9 e T> there exists pe for which the variational principle holds. Then one has, 
with Uj = (pe,Hj), 

n n 
3=1 3=1 

This proves the inequality in one direction. Next, fix e > and let U be such that 

n 

m<s(u)-J2 d i u 3 +e > ( 4 °) 

3=1 

with [/ such that S(U) is defined by ( |36l) . Then, there follows from the definition of 
S(U) that 7] G V exists such that S(U) = I(p v ) with (p v ,Hj) = Uj, j = l,2,---,n. 
The variational principle now implies that 

n n 

3=1 3=1 

n 

> $(0)-e. (41) 
Because e > is arbitrary, the inequality in the other direction follows now. 

□ 

The inverse Legendre transformation reads 

n 
3=1 

It is a concave function. 

Proposition 2 One has S(U) = S(U) for all U for which S(U) is defined by fifty) . 
Proof 

11 



From the definition of the Massieu function there follows that 

n 

$(0) > S{U) - 9 i u i for a11 9 G R ™ ( 43 ) 

This implies that S(U) < S(U). On the other hand, from the definition (|36|) of S(U) 
follows that 

n 

s(u) = m + J2 e i u i> ( 44 ) 

i=i 

where # is such that pg G Pc/. This implies S(U) > S(U). The two inequalities together 
establish the desired equality. 

□ 



10 Thermodynamic relations 

Like in the previous Section, there is given a statistical model {pe)eev belonging to the 
generalised exponential family. In addition, let T>q be an open subset of T> on which 
the map 6 — > {pg, Hj) is continuous. 

The following results are typical properties of Legendre transforms. For complete- 
ness, proofs are given. 

Proposition 3 The first derivative of the Massieu function exists for 9 in T> . 
It satisfies 

^ = -(p e , Hj ), 6eV . (45) 



Proof 

From the definitions one has for 6 and 9 + rj in T>o 

n 

H0 + V ) = I(p e+ri )-^2(e j + ri j )(pg +t „H j ) 

n 



n 



and 



3(*)-!><».tfi>. ( 4 6) 
j'=i 



$(0) = I{p e )-^Ojlj>o,Hj) 
12 



3=1 
n 



H6 + v) + J2vAPe^Hj)- (47) 

3=1 



Expression (|45p now follows using the continuity of the map 9 — ► (pg, Hj). 



Introduce the metric tensor 



□ 



= S- (48) 

Because the Massieu function is convex the matrix g(9) is positive definite, when- 
ever it exists. By the previous Proposition one has 

9ij(0) = -ljf i <P9,H j ) (49) 

for those 9 in T> for which the derivative exists. 

In thermodynamics, the derivative of S(U) equals the inverse of the absolute tem- 
perature T. Here, the analogous property becomes 

Proposition 4 Let 9 G T>q and define U by Uj = (pg,Hj). Then one has 

dS 



9 J} j = l,2,---,n. (50) 



Proof 

On a vicinity of 9 is S(U) = $(#) + YTj=i Hence, one can write 

dS >A (ffi TT \ d9 k n 

Wi - eu+^m^- (5l) 

But the first term in the r.h.s. vanishes because the previous Proposition holds. Hence, 
the desired result follows. 

□ 

The two relations ( 1451) and ( 1501) are dual in the sense of Amari pQ. In thermody- 
namics, the entropy S(U) and Massieu's function are state functions, the energies 
Uj are extensive thermodynamic variables, the parameters 9j are the intensive thermo- 
dynamic variables. 
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11 Escort probabilities 



Let us now make the additional assumption that the function f(u), which enters the 
definition (jSJ) of the generalised entropy, has a derivative f'(u). Because f(u) was 
supposed to be strictly increasing, one can write 



/(«) = /(l)- / dv— - ue(0,l], (52) 

Ju <P\V) 

where <fi(v) = l/(df/dv) is a strictly positive function. 

As before, there is given a statistical model {pe)eev belonging to the generalised 
exponential family, and T>q is an open subset of T> on which the map 9 — > (pg, Hj) is 
continuous. The set Aq(9) is the set of a G A for which p${a) = 0. From theorems 1 
and 2 now follows 

§f Pe,a = 4>{Pe,a) - HjiaYj , 9 G V , a G A \ A (9). (53) 

This expression was used in [8] as a condition under which a generalisation of the 
well-known bound of Cramer and Rao is optimal. An immediate consequence of ( 1531) 
is 

Proposition 5 Assume the regularity condition 

° = Y,w Pe{a) - (54) 

a J 

Assume in addition that 

z{9) =Y,'<PiPe, a) < +oo, (55) 
where ^2 ' denotes the sum over all a G A \ A (9). Then one has 



Proof 

On a vicinity of the given 9 one has (I53"j) . Hence, by summing (1531) over a £ A\A (9) 
one obtains using (jSi 



= X;V(p*«)(-J^-tf;(a)J, ^Po,aeA\^). (57) 

□ 
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The probability distribution 

Pe.a = -^<P{pe,a), Pe,a ^ 0, 
z{9) 

= 0, otherwise, (58) 

when it exists, is called the escort of the exponential family (pe)eev- With this notation, 
one can write the result of the Proposition as 

da , , 
^- = -<^^>- (59) 

12 Generalised Fisher information 

Let be given a model {pe)eev for which z(8), as given by (155|) . converges. The es- 
cort probabilities Pq^ are defined by (158]) . Then one can define a generalised Fisher 
information matrix by 

I hJ (6) = faXifflXjie)), (60) 
where the score variables are defined by 

X i>a (6) = ^^Pe, a - (61) 

Note that in the standard case of h(u) = — ulnu one has <f>(u) = u so that the escort 
probabilities Pe coincide with the pg. Then fl60l) reduces to the conventional definition. 

Fix now a set of Hamiltonians Hi(a), H 2 (a), H n (a). Then one can define a 
covariance matrix a(B) by 

(TiM = (P e , HiHj) - (P e , H { ) (P e , Hj). (62) 

Proposition 6 Assume a finite alphabet A. Then one has 

I id (9) = z(6)g i , j = 2?(9)<r id . (63) 

Proof 

From fl53l) follows 

XjM = z(6) - Hfa)) (64) 

for all 9 G T>q and a G A\A (6). Hence, the Fisher information matrix becomes 

Iy(60 = ^)E^("l" ffi(a) ) (65) 
aeA \ 1 / V 3 J 
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Using pi there follows I id (6) = z 2 (6)a ld . 

On the other hand, from (1491) and (1531) there follows 

d 

9i,j{°) = —gnZ^P^Hjia) 

1 aeA 



-EM~lr-^ (a) K (a) - (66) 



aeA 

Using fl56l) there follows gij(0) = z(9)a ii j. 



□ 



The assumption of a finite alphabet is made to ensure that the conditions of Propo- 
sition [5] are fulfilled and that the sum and derivative may be interchanged in fl66l) . 
The generalised inequality of Cramer and Rao, in the present notations, reads [8] 

^2 °kiu k ui ) I ^2 hmvk 1 > I 9kiu k vi J , (67) 

with u and v arbitrary real vectors. The previous Proposition then implies that the 
inequality becomes an equality when u = v , when P is related to p via fl58l) . and when 
pe belongs to a generalised exponential family. 



13 Non-extensive thermostatistics 

Define the g-deformed logarithm by [151 IS] 



ln» = -|- (u 1 -* - 1) . (68) 
It is a strictly increasing function, defined for u > 0. Indeed, its derivative equals 

^-InJu) = — > 0. (69) 

In the limit g = 1 the g-deformed logarithm converges to the nature logarithm lnw. 

The deformed logarithm can be used in more than one way to define an entropy 
function. The g-entropy ([3]) can be written as 



(70) 



Comparison with (j4j) gives 

h{u) = (u*- 1 - 1) = u In, (-) . (71) 

1 — q v ' \u/ 

16 



One has h(0) = h(l) = 0. Taking the derivative gives 



/(«) = " 



dh 



- (qu^ 1 - 1) . 



(72) 



du q 

It is a strictly increasing function on (0, 1] when q > 0. The function <p(u) is given by 

(j)(u) = -u 2 - q . (73) 
Q 

The probability distributions belonging to the generalised exponential family, corre- 
sponding with ( 1701) . are 



Pa = q 



1/(1-9) 



l-(g-l)«-(g-l)^^( 



1/(9-1) 



(74) 



with [u} + = max{0, «}. This is indeed the kind of probability distribution discussed in 
the original paper of Tsallis [H]. However, more often used is the alternative of |17j . 
In the latter paper the concept of escort probability distributions was introduced into 
the literature. They were defined by 



1 



Pi 



(75) 



which in the present notations corresponds with <f>(u) proportional to u q . This can be 
obtained by replacing the constant q by 2 — q in (J7Djl . The entropy function then reads 



Hp) = -^Pa^qiPa 



(76) 



which is not the expression that one would write down based on the information theo- 
retical argument that ln(l/p a ) is the amount of information (counted in units of In 2), 
gained from an event occurring with probability p a . Note that with this definition of 
entropy function the condition q < 2 is needed in order to satisfy the requirements that 
the function f(u) = -^(ula q (u) is an increasing function. 



14 The percolation problem 

This example has been treated in [5] . It is a genuine example of an important model of 
statistical physics which does not belong to the exponential family. In addition, it is an 
example which fits into the present generalised context provided that one allows that 
the function h(u) appearing in the definition (j4j) of the generalised entropy function is 
stochastic. 

In the site percolation problem [13], the points of a lattice are occupied with prob- 
ability q, independent of each other. The point at the origin is either unoccupied, with 
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probability p$, or it belongs to a cluster of shape i, with probability pi. This cluster is 
finite with probability 1, provided that < q < q c , where q c is the percolation thresh- 
old. The probability p^ that the origin belongs to an infinite cluster is strictly positive 
for q > q c . However, for the sake of simplicity of the presentation, < q < q c will be 
assumed — see [9] for the general case. 
These probabilities are given by 

Pi = c^l - g)*« (77) 

where Cj is the number of different clusters of shape i, s(i) is the number of occupied 
sites in the cluster, and t(i) is the number of perimeter sites, this is, of unoccupied 
neighbouring sites. Note that (ITT)) also holds when the origin is not occupied, provided 
that one convenes that c(0) = 1, s(0) = and t(0) = 1. 



Choose the Hamiltonian 



H{i) = 77TT~~~77T - (78) 
t(l) + S[l) 



and introduce the parameter 9 by 



Then one can write 



(k 

with 



e = \n-^—, q = ——„. (79) 



\ n P±= [-a(6) - 6H{i)} [s{i) + t{i)\ , (80) 



a{9) =ln(l + e- e ) (81) 



This looks like an exponential family, except for the extra factor [s(i) + t(i)] in the 
r.h.s.. Introduce the stochastic function 

/<(«) - J^Ty w 

Then the above expression is of the form (|14p . By integrating fi(u) one obtains 

K(u) = - 7 ^. (83) 

s(i) + t(l) 

It is now straightforward to verify that the percolation problem belongs to a generalised 
exponential family. The relevant entropy function for the percolation model in the non- 
percolating region < q < q c is therefore 

^ Pilnpi 
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15 Discussion 



Sections 3 to 6 of the present paper discuss the variational principle, which is stronger 
than the maximum entropy principle. It is shown that the method of Lagrange multi- 
pliers leads to the correct result, even in the context of generalised entropy functions. 
The difficulty that arises is known as the cutoff problem: the optimising probability 
distribution may assign vanishing probabilities to some of the events. To cope with 
this situation the two cases have been considered separately. Theorem 1 treats the 
standard case, Theorem 2 copes with the vanishing probabilities. 

In Section 7, a generalised definition of an exponential family is given. It identifies 
the members of the generalised exponential family with the solutions of the variational 
principle, given a generalised entropy function of the usual form (J3J). The definition 
of the standard exponential family corresponds of course with the Boltzmann-Gibbs- 
Shannon entropy. Entropy functions I(p) and XI (p), with A > 0, determine the same 
exponential family. Assuming some technical condition, the intersection of different 
generalised exponential families is empty - - see Theorem HI As a consequence, a 
one-to one relation has been established between generalised exponential families and 
classes of equivalent entropy functions. 

In [8], the notion of phi-exponential family was introduced. The 'phi' in this name 
refers to the function 0(f), introduced in fl52l) . It is one divided by the derivative of 
the function f(v) appearing in the expression ([6]) for the entropy function I(p). The 
assumption that the derivative of f(v) exists for all v > has been eliminated in 
the present paper. More important is that the definition of a generalised exponential 
family is now given directly in terms of the entropy function I(p), via the variational 
principle, without relying on the notion of deformed exponential functions. 

Sections 9 to 12 discuss the geometric properties of a generalised exponential family, 
using a terminology coming from 150 year old thermodynamics. The main result is 
(1^5]) . proving the equality of the three quantities generalised Fisher information, metric 
tensor times partition sum z(9), and covariance matrix multiplied with z 2 (8). The 
covariance matrix is calculated using the escort family of probability distributions. 

Many applications of generalised exponential families are found in the literature, in 
the context of nonextensive thermostatistics. The latter has been discussed in Section 

13. A completely different kind of example is found in percolation theory — see Section 

14. It illustrates the possibility that the function f(u), which determines the entropy 
function J(p) via ([6]), is of a stochastic nature. One can expect that many other 
applications will be found in the near future. 

Finally note that the generalisation of the present work to quantum probabilities is 
straightforward. Let be given a strictly increasing function f(u), continuous on (0, 1]. 
The expression §6§ can be generalised to 



where p is any density operator in a Hilbert space. The Bregman divergence (jSj) 
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generalises to 

D(p\\p') = I(p')-I(p)-Tr(p-p')f(p). 
The basic inequality D(p\\p') > is proved using Klein's inequality 



(86) 

-see 2.5.2. of [12]. 
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